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Abstract 

Deep neural networks currently demonstrate state-of-the-art performance in sev¬ 
eral domains. At the same time, models of this class are very demanding in terms 
of computational resources. In particular, a large amount of memory is required 
by commonly used fully-connected layers, making it hard to use the models on 
low-end devices and stopping the further increase of the model size. In this paper 
we convert the dense weight matrices of the fully-connected layers to the Tensor 
Train format such that the number of parameters is reduced by a huge factor 
and at the same time the expressive power of the layer is preserved. In particular, 
for the Very Deep VGG networks 1 I 21 I 1 we report the compression factor of the 
dense weight matrix of a fully-connected layer up to 200000 times leading to the 
compression factor of the whole network up to 7 times. 


1 Introduction 

Deep neural networks currently demonstrate state-of-the-art performance in many domains of large- 
scale machine learning, such as computer vision, speech recognition, text processing, etc. These 
advances have become possible because of algorithmic advances, large amounts of available data, 
and modern hardware. For example, convolutional neural networks (CNNs) ifollPIl] show by a large 
margin superior performance on the task of image classihcation. These models have thousands of 
nodes and millions of learnable parameters and are trained using millions of images |[T9|] on powerful 
Graphics Processing Units (GPUs). 

The necessity of expensive hardware and long processing time are the factors that complicate the 
application of such models on conventional desktops and portable devices. Consequently, a large 
number of works tried to reduce both hardware requirements (e. g. memory demands) and running 
times (see Sec.|2]). 

In this paper we consider probably the most frequently used layer of the neural networks: the fully- 
connected layer. This layer consists in a linear transformation of a high-dimensional input signal to a 
high-dimensional output signal with a large dense matrix defining the transformation. For example, 
in modern CNNs the dimensions of the input and output signals of the fully-connected layers are 
of the order of thousands, bringing the number of parameters of the fully-connected layers up to 
millions. 

We use a compact multiliniear format - Tensor-Train (TT-format) ijT^ - to represent the dense 
weight matrix of the fully-connected layers using few parameters while keeping enough flexibil¬ 
ity to perform signal transformations. The resulting layer is compatible with the existing training 
algorithms for neural networks because all the derivatives required by the back-propagation algo¬ 
rithm ifT^ can be computed using the properties of the TT-format. We call the resulting layer a 
TT-layer and refer to a network with one or more TT-layers as TensorNet. 

We apply our method to popular network architectures proposed for several datasets of different 
scales: MNIST jTsIl . CIFAR-10 ifT^ . ImageNet |0]. We experimentally show that the networks 
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with the TT-layers match the performance of their uncompressed counterparts but require up to 
200 000 times less of parameters, decreasing the size of the whole network by a factor of 7. 

The rest of the paper is organized as follows. We start with a review of the related work in Sec. |2] 
We introduce necessary notation and review the Tensor Train (TT) format in Sec. [3 In Sec.m 
we apply the TT-format to the weight matrix of a fully-connected layer and in Sec. |5] derive all 
the equations necessary for applying the back-propagation algorithm. In Sec. |6] we present the 
experimental evaluation of our ideas followed by a discussion in Sec.|2l 

2 Related work 

With sufficient amount of training data, big models usually outperform smaller ones. However state- 
of-the-art neural networks reached the hardware limits both in terms the computational power and 
the memory. 

In particular, modern networks reached the memory limit with 89% or even 100% memory 
occupied by the weights of the fully-connected layers so it is not surprising that numerous attempts 
have been made to make the fully-connected layers more compact. One of the most straightforward 
approaches is to use a low-rank representation of the weight matrices. Recent studies show that 
the weight matrix of the fully-connected layer is highly redundant and by restricting its matrix rank 
it is possible to greatly reduce the number of parameters without significant drop in the predictive 
accuracy llEESHH]. 

An alternative approach to the problem of model compression is to tie random subsets of weights 
using special hashing techniques (|4[]. The authors reported the compression factor of 8 for a two¬ 
layered network on the MNIST dataset without loss of accuracy. Memory consumption can also be 
reduced by using lower numerical precision lH or allowing fewer possible carefully chosen param¬ 
eter values 0. 

In our paper we generalize the low-rank ideas. Instead of searching for low-rank approximation of 
the weight matrix we treat it as multi-dimensional tensor and apply the Tensor Train decomposition 
algorithm fl3] . This framework has already been successfully applied to several data-processing 
tasks, e. g. lll6llz^ . 

Another possible advant^e of our approach is the ability to use more hidden units than was available 
before. A recent work 121 shows that it is possible to construct wide and shallow (i. e. not deep) 
neural networks with performance close to the state-of-the-art deep CNNs by training a shallow 
network on the outputs of a trained deep network. They report the improvement of performance 
with the increase of the layer size and used up to 30 000 hidden units while restricting the matrix 
rank of the weight matrix in order to be able to keep and to update it during the training. Restricting 
the TT-ranks of the weight matrix (in contrast to the matrix rank) allows to use much wider layers 
potentially leading to the greater expressive power of the model. We demonstrate this effect by 
training a very wide model (262 144 hidden units) on the CIFAR-10 dataset that outperforms other 
non-convolutional networks. 

Matrix and tensor decompositions were recently used to speed im the inference time of CNNs |0, 
0. While we focus on fully-connected layers, Lebedev et al. 10 used the CP-decomposition to 
compress a 4-dimensional convolution kernel and then used the properties of the decomposition to 
speed up the inference time. This work shares the same spirit with our method and the approaches 
can be readily combined. 

Gilboa et al. exploit the properties of the Kronecker product of matrices to perform fast matrix-by- 
vector multiplication j^]. These matrices have the same structure as TT-matrices with unit TT-ranks. 

Compared to the Tucker format ||0 and the canonical format (21, the TT-format is immune to 
the curse of dimensionality and its algorithms are robust. Compared to the Hierarchical Tucker 
format ifTTI] . TT is quite similar but has simpler algorithms for basic operations. 

3 TT-format 

Throughout this paper we work with arrays of different dimensionality. We refer to the one¬ 
dimensional arrays as vectors, the two-dimensional arrays - matrices, the arrays of higher dimen¬ 
sions - tensors. Bold lower case letters (e. g. a) denote vectors, ordinary lower case letters (e. g. 
a{i) = Oi) - vector elements, bold upper case letters (e. g. A) - matrices, ordinary upper case letters 
(e. g. A{i, j)) - matrix elements, calligraphic bold upper case letters (e. g. .A) - for tensors and 
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ordinary calligraphic upper case letters (e. g. A{i) = A{ii ,..., id)) - tensor elements, where d is 
the dimensionality of the tensor A. 

We will call arrays explicit to highlight cases when they are stored explicitly, i. e. by enumeration of 
all the elements. 

A d-dimensional array (tensor) A is said to be represented in the TT-format lIT^ if for each dimen¬ 
sion k = 1,...,d and for each possible value of the fc-th dimension index jk = 1,... ,nk there 
exists a matrix Gk[jk] such that all the elements of A can be computed as the following matrix 
product: 

A{ji, ■ ■ ■ ,jd) = Gi[ji\G2[j2\ • • • Gd[jd\- (1) 

All the matrices Gk\jk\ related to the same dimension k are restricted to be of the same 
size Ik -1 X ffe- The values I'o and id equal 1 in order to keep the matrix product ([T]i of size 1 x 1. In 
what follows we refer to the representation of a tensor in the TT-format as the TT-representation or 
the TT-decomposition. The sequence is referred to as the TT-mnks of the TT-representation 

of A (or the ranks for short), its maximum - as the maximal TT-rank of the TT-representation 
of .4.: r = maxfc=o,....d ifc- The collections of the matrices corresponding to the same 

dimension (technically, 3-dimensional arrays Qk) are called the cores. 

Oseledets IUtI Th. 2.1] shows that for an arbitrary tensor A a TT-representation exists but is not 
unique. The ranks among different TT-representations can vary and it’s natural to seek a representa¬ 
tion with the lowest ranks. 

We use the symbols Gk[jk]{cik-i,ctk) to denote the element of the matrix Gk[jk] in the position 
(afc_i, afc), where ak-i = 1,..., r^-i, ak = 1,..., r^. Equation ([T]i can be equivalently rewritten 
as the sum of the products of the elements of the cores: 

-4(ji,... ,jd) = ^Gi[ji](Q!o,Q!i) • ■ .Gd[jd]{ad-i,ad). (2) 

ao,...,aci 

The representation of a tensor A via the explicit enumeration of all its elements requires to store 
Ylk=i numbers compared with Yl‘k=i numbers if the tensor is stored in the TT-format. 

Thus, the TT-format is very efficient in terms of memory if the ranks are small. 

An attractive property of the TT-decomposition is the ability to efficiently perform several types 
of operations on tensors if they are in the TT-format: basic linear algebra operations, such as the 
addition of a constant and the multiplication by a constant, the summation and the entrywise product 
of tensors (the results of these operations are tensors in the TT-format generally with the increased 
ranks); computation of global characteristics of a tensor, such as the sum of all elements and the 
Frobenius norm. See IT^ for a detailed description of all the supported operations. 

3.1 TT-representations for vectors and matrices 

The direct application of the TT-decomposition to a matrix (2-dimensional tensor) coincides with 
the low-rank matrix format and the direct TT-decomposition of a vector is equivalent to explicitly 
storing its elements. To be able to efficiently work with large vectors and matrices the TT-format 
for them is dehned in a special manner. Consider a vector b G where N = 0^=1 ^fc- We 
can establish a bijection p between the coordinate ^ G {1,..., N} of b and a d-dimensional vector- 
index p{£) = ..., p-di^)) of the corresponding tensor B, where pk{(-) C {1) ■ • ■, ^k}. The 

tensor B is then dehned by the corresponding vector elements: i3(/r(f)) = bg. Building a TT- 
representation of B allows us to establish a compact format for the vector b. We refer to it as a 
TT-vector. 

Now we dehne a TT-representation of a matrix W G where M = I\k=i and 

W = Let bijections iz(t) = ..., i/d(f)) and/r(£) = (pi(£),...,map 

row and column indices t and i of the matrix W to the d-dimensional vector-indices whose fc-th 
dimensions are of length mk and Uk respectively, k = 1,... ,d. From the matrix W we can form 
a d-dimensional tensor W whose fc-th dimension is of length rrikUk and is indexed by the tuple 
(^fc(f), /Xfc(£)). The tensor W can then be converted into the TT-format: 

W(t,C) = ..., {vd{t), Pd{i))) = Gi[i2i{t), pi{i)]... Gd[vd{t), Pd{i)], (3) 

where the matrices Gk[r'kit), pki^)], k = 1,... ,d, serve as the cores with tuple {vk{t), pk{()) 
being an index. Note that a matrix in the TT-format is not restricted to be square. Although index- 
vectors vit) and ij.{£) are of the same length d, the sizes of the domains of the dimensions can vary. 
We call a matrix in the TT-format a TT-matrix. 
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All operations available for the TT-tensors are applicable to the TT-vectors and the TT-matrices as 
well (for example one can efficiently sum two TT-matrices and get the result in the TT-format). Ad¬ 
ditionally, the TT-format allows to efficiently perform the matrix-by-vector (matrix-by-matrix) prod¬ 
uct. If only one of the operands is in the TT-format, the result would be an explicit vector (matrix); if 
both operands are in the TT-format, the operation would be even more efficient and the result would 
be given in the TT-format as well (generally with the increased ranks). For the case of the TT-matrix- 
by-explicit-vector product c = Wb, the computational complexity is 0{dr^ mmax{M, N}), 
where d is the number of the cores of the TT-matrix W, m = max^^i ... m^, r is the maximal 

rank and N = 0^=1 *^he length of the vector b. 

The ranks and, correspondingly, the efficiency of the TT-format for a vector (matrix) depend on the 
choice of the mapping (mappings u{t) and /x((')) between vector (matrix) elements and the un¬ 
derlying tensor elements. In what follows we use a column-major MATLAB reshape commandQ 
to form a d-dimensional tensor from the data (e. g. from a multichannel image), but one can choose 
a different mapping. 

4 TT-layer 

In this section we introduce the TT-layer of a neural network. In short, the TT-layer is a fully- 
connected layer with the weight matrix stored in the TT-format. We will refer to a neural network 
with one or more TT-layers as TensorNet. 

Fully-connected layers apply a linear transformation to an A^-dimensional input vector x: 

y = Wx + b, (4) 

where the weight matrix W € vector b S define the transformation. 

A TT-layer consists in storing the weights W of the fully-connected layer in the TT-format, allowing 
to use hundreds of thousands (or even millions) of hidden units while having moderate number of 
parameters. To control the number of parameters one can vary the number of hidden units as well 
as the TT-ranks of the weight matrix. 

A TT-layer transforms a d-dimensional tensor X (formed from the corresponding vector x) to the d- 
dimensional tensor ^ (which correspond to the output vector y). We assume that the weight matrix 
W is represented in the TT-format with the cores Gk\ik:jk\- The linear transformation (HJl of a 
fully-connected layer can be expressed in the tensor form: 

..,id)= ^ Gi[ii,ji]... Gd[id,jd] X{ji, ...,jd)+ B{ii,.. .,id)- (5) 

Direct application of the TT-matrix-by-vector operation for the Eq. (|5]) yields the computational 
complexity of the forward pass 0{dr^m max{m, n}'^) = 0{dr^m max{M, N}). 

5 Learning 

Neural networks are usually trained with the stochastic grad ient descent algorithm where the gradi¬ 
ent is computed using the back-propagation procedure IT^ . Back-propagation allows to compute 
the gradient of a loss-function L with respect to all the parameters of the network. The method starts 
with the computation of the gradient of L w.r.t. the output of the last layer and proceeds sequentially 
through the layers in the reversed order while computing the gradient w.r.t. the parameters and the 
input of the layer making use of the gradients computed earlier. Applied to the fully-connected lay¬ 
ers (01 the back-propagation method computes the gradients w.r.t. the input x and the parameters 
W and b given the gradients ^ w.r.t to the output y: 

dL ^ dL dL dL ^ dL dL 
dx dy ’ dW dy ’ db dy 

In what follows we derive the gradients required to use the back-propagation algorithm with the TT- 
layer. To compute the gradient of the loss function w.r.t. the bias vector b and w.r.t. the input vector 
X one can use equations (|6ll. The latter can be applied using the matrix-by-vector product (where the 
matrix is in the TT-format) with the complexity of 0{dr^n max{TO, n}^) = 0{dr^n max{M, W}). 

’http://WWW.mathworks.com/heIp/matlab/ref/reshape.html 
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Operation 

Time 

Memory 

FC forward pass 

0{MN) 

0{MN) 

TT forward pass 

0{dr‘^muiax{M, N}) 

0{r max{M, W}) 

FC backward pass 

0{MN) 

0{MN) 

TT backward pass 

0{d^ m max{M, N}) 

0(r3 rrrax{M, N}) 


Table 1: Comparison of the asymptotic complexity and memory usage of an M x N TT-layer and 
an M X N fully-connected layer (FC). The input and output tensor shapes are mi x ... x md and 
ni X ... X rid respectively {m = d f^k) r is the maximal TT-rank. 

To perform a step of stochastic gradient descent one can use equation ® to compute the gradient 
of the loss function w.r.t. the weight matrix W, convert the gradient matrix into the TT-format 
(with the TT-SVD algorithm 03]) and then add this gradient (multiplied by a step size) to the 
current estimate of the weight matrix: Wk+i = VFfc + lk-§^- However, the direct computation of 
requires 0{MN) memory. A better way to learn the TensorNet parameters is to compute the 
gradient of the loss function directly w.r.t. the cores of the TT-representation of W. 

In what follows we use shortened notation for prefix and postfix sequences of indices: := 

{ii,... ik ■= (ik+i, ■ ■ ■ ^id), i = We also introduce notations for partial 

core products: 

^k ’Jfc ] •“ Jl] ■ • ■ 

^k ^k+l['ik+l! jk+l] ■ ■ ■ Gd[idT jd]- 

We now rewrite the definition of the TT-layer transformation Q for any k = 2,... ,d — 1: 

y{i) =y(.ik,ik,it) = PkiH’jk]Gk[ik,jk]P^[it,3k]^{jk,jk,j^)+B{i). (8) 

3k >3k,jk 


The gradient of the loss function L w.r.t. to the /c-th core in the position [ik,jk] can be computed 
using the chain rule: 

dL ^ ^ dL dyji)^ 

dGk[ik,jk] i dy{i) dGk[ik,jk] 

rfc_i X Tfe 

Given the gradient matrices the summation (|3l can be done explicitly in 0{M ik-i I'fc) 

uCrk [f-k :Jk\ 

time, where M is the length of the output vector y. 

We now show how to compute the matrix for any values of the core index k S {1,..., d} 

C'Crfc [^k ■:3k\ 

and ik € m^}, jk £ {1,...,n^}. For any i = {ii,... ,id) such that ik ^ ik the value 

of y{i) doesn’t depend on the elements of Gk [ik , jk] making the corresponding gradient j j 

equal zero. Similarly, any summand in the Eq. ® such that jk ^ jk doesn’t affect the gradient 
—These observations allow us to consider only ik = ik and jk = jk- 

C/Crfe [Ik ijk\ 

is a linear function of the core Gk[ik:jk] its gradient equals the following expres¬ 
sion: 


9y{ik^^k,it) 

dGk[ik-; jk] 


3k -3t 


[*/c ^jk ]) 

rfc-l Xl 


(PkiitJkd'^UkJkJi). 


Ixrfc 


( 10 ) 


We denote the partial sum vector as Rk [jf. ,jk,ik] ^ • 

P-k[jl I • ■ • I Jfe-r J Jfci ik+l: ■ ■ ■ jid\ = Rk [jk I ifci ^ ^ Rk i'^k ’3 k] ^Uk ijkijk)' 

3t 

Vectors Rk[jk , jk^i'k] for all the possible values of k, j^, jk and can be computed via dy¬ 
namic programming (by pushing sums w.r.t. each jk+i, ■ ■ ■ ,jd inside the equation and summing 
out one index at a time) in 0{dr'^mmax{M, N}). Substituting these vectors into ( fTOb and using 
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- 32 X 32 

- 4 X 8 X 8 X 4 

- 4x4x4x4x4 

- 2x2x8x8x2x2 

- 2^° 

— matrix rank 
• uncompressed 

10^ lo'* 10'* 10** 10** 

number of parameters in the weight matrix of the first layer 

Figure 1: The experiment on the MNIST dataset. We use a two-layered neural network and substitute 
the first 1024 x 1024 fully-connected layer with the TT-layer (solid lines) and with the matrix rank 
decomposition based layer (dashed line). The solid lines of different colors correspond to different 
ways of reshaping the input and output vectors to tensors (the shapes are reported in the legend). To 
obtain the points of the plots we vary the maximal TT-rank or the matrix rank. 

(again) dynamic programming yields us all the necesary matrices for summation (|9]). The overall 
computational complexity of the backward pass is 0{cP m max{M, N}). 

The presented algorithm reduces to a sequence of matrix-by-matrix products and permutations of 
dimensions and thus can be accelerated on a GPU device. 

6 Experiments 

6.1 Parameters of the TT-layer 

In this experiment we investigate the properties of the TT-layer and compare different strategies for 
setting its parameters: dimensions of the tensors representing the input/output of the layer and the 
TT-ranks of the compressed weight matrix. We run the experiment on the MNIST dataset ifTsIl for 
the task of handwritten-digit recognition. As a baseline we use a neural network with two fully- 
connected layers (1024 hidden units) and rectified linear unit (ReLU) achieving 1.9% error on the 
test set. For more reshaping options we resize the original 28 x 28 images to 32 x 32. 

We train several networks differing in the parameters of the single TT-layer. The networks contain 
the following layers: the TT-layer with weight matrix of size 1024 x 1024, ReLU, the fully-connected 
layer with the weight matrix of size 1024 x 10. We test different ways of reshaping the input/output 
tensors and try different ranks of the TT-layer. As a simple compression baseline in the place of 
the TT-layer we use the fully-connected layer such that the rank of the weight matrix is bounded 
(implemented as follows: the two consecutive fully-connected layers with weight matrices of sizes 
1024 X r and r x 1024, where r controls the matrix rank and the compression factor). The results of 
the experiment are shown in Figure[T] We conclude that the TT-ranks provide much better flexibility 
than the matrix rank when applied at the same compression level. In addition, we observe that the 
TT-layers with too small number of values for each tensor dimension and with too few dimensions 
perform worse than their more balanced counterparts. 

Comparison with HashedNet 0]. We consider a two-layered neural network with 1024 hidden 
units and replace both fully-connected layers by the TT-layers. By setting all the TT-ranks in the 
network to 8 we achieved the test error of 1.6% with 12 602 parameters in total and by setting all 
the TT-ranks to 6 the test eri'or of 1.9% with 7 698 parameters. Chen et al. 0] report results on the 
same architecture. By tying random subsets of weights they compressed the network by the factor 
of 64 to the 12 720 parameters in total with the test error equal 2.79%. 

6.2 CIFAR-10 

CIFAR-10 dataset |[T^ consists of 32 x 32 3-channel images assigned to 10 different classes: air¬ 
plane, automobile, bird, cat, deer, dog, frog, horse, ship, truck. The dataset contains 50000 train and 
10000 test images. Following ffol] we preprocess the images by subtracting the mean and perform¬ 
ing global contrast normalization and ZCA whitening. 

As a baseline we use the CIFAR-10 Quick CNN, which consists of convolutional, pooling and 
non-linearity layers followed by two fully-connected layers of sizes 1024 x 64 and 64 x 10. We fix 
the convolutional part of the network and substitute the fully-connected part by a 1024 x N TT-layer 
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Architecture 

TT-layers 

compr. 

vgg-16 

compr. 

vgg-19 

compr. 

vgg-16 
top 1 

vgg-16 
top 5 

vgg-19 
top 1 

vgg-19 
top 5 

FC FC FC 

1 

1 

1 

30.9 

11.2 

29.0 

10.1 

TT4 FC FC 

50 972 

3.9 

3.5 

31.2 

11.2 

29.8 

10.4 

TT2 FC FC 

194 622 

3.9 

3.5 

31.5 

11.5 

30.4 

10.9 

TTlFC FC 

713 614 

3.9 

3.5 

33.3 

12.8 

31.9 

11.8 

TT4 TT4 FC 

37 732 

7.4 

6 

32.2 

12.3 

31.6 

11.7 

MRl FC FC 

3 521 

3.9 

3.5 

99.5 

97.6 

99.8 

99 

MR5 FC FC 

704 

3.9 

3.5 

81.7 

53.9 

79.1 

52.4 

MR50 FC FC 

70 

3.7 

3.4 

36.7 

14.9 

34.5 

15.8 


Table 2: Substituting the fully-connected layers with the TT-layers in vgg-16 and vgg-19 networks 
on the ImageNet dataset. FC stands for a fully-connected layer; TTD stands for a TT-layer with 
all the TT-ranks equal MRD stands for a fully-connected layer with the matrix rank restricted 
to We report the compression rate of the TT-layers matrices and of the whole network in the 
second, third and fourth columns. 

followed by ReLU and by a TV x 10 fully-connected layer. With N = 3125 hidden units (contrary 
to 64 in the original network) we achieve the test error of 23.13% without fine-tuning which is 
slightly better than the test error of the baseline (23.25%). The TT-layer treated input and output 
vectors as4x4x4x4x4 and 5x5x5x5x5 tensors respectively. All the TT-ranks equal 
8, making the number of the parameters in the TT-layer equal 4 160. The compression rate of the 
TensorNet compared with the baseline w.r.t. all the parameters is 1.24. In addition, substituting the 
both fully-connected layers by the TT-layers yields the test error of 24.39% and reduces the number 
of parameters of the fully-connected layer matrices by the factor of 11.9 and the total parameter 
number by the factor of 1.7. 

For comparison, in 0 the fully-connected layers in a CIFAR-10 CNN were compressed by the 
factor of at most 4.7 times with the loss of about 2% in accuracy. 

6.2.1 Wide and shallow network 

With sufficient amount of hidden units, even a neural network with two fully-connected layers and 
sigmoid non-linearity can approximate any decision boundary fH]. Traditionally, very wide shallow 
networks are not considered because of high computational and memory demands and the over¬ 
fitting risk. TensorNet can potentially address both issues. We use a three-layered TensorNet of 
the following architecture; the TT-layer with the weight matrix of size 3 072 x 262 144, ReLU, the 
TT-layer with the weight matrix of size 262 144 x 4 096, ReLU, the fully-connected layer with the 
weight matrix of size 4 096 x 10. We report the test error of 31.47% which is (to the best of our 
knowledge) the best result achieved by a non-convolutional neural network. 

6.3 ImageNet 

In this experiment we evaluate the TT-layers on a large scale task. We consider the 1000-class 
ImageNet ILSVRC-2012 dataset ifT^ . which consist of 1.2 million training images and 50 000 
validation images. We use deep the CNNs vgg-16 and vgg-19 flUl as the reference model^ Both 
networks consist of the two parts: the convolutional and the fully-connected parts. In the both 
networks the second part consist of 3 fully-connected layers with weight matrices of sizes 25088 x 
4096, 4096 x 4096 and 4096 x 1000. 

In each network we substitute the first fully-connected layer with the TT-layer. To do this we reshape 
the 25088-dimensional input vectors to the tensors of the size 2x7x8x8x7x4 and the 4096- 
dimensional output vectors to the tensors of the size 4x4x4x4x4x4. The remaining fully- 
connected layers are initialized randoml y. T he parameters of the convolutional parts are kept fixed 
as trained by Simonyan and Zisserman Il21ll . We train the TT-layer and the fully-connected layers 
on the training set. In Table |2]we vary the ranks of the TT-layer and report the compression factor of 
the TT-layers (vs. the original fully-connected layer), the resulting compression factor of the whole 
network, and the top 1 and top 5 errors on the validation set. In addition, we substitute the second 
fully-connected layer with the TT-layer. As a baseline compression method we constrain the matrix 
rank of the weight matrix of the first fully-connected layer using the approach of ^. 

^ After we had started to experiment on the vgg-16 network the vgg-* networks have been improved by 
the authors. Thus, we report the results on a slightly outdated version of vgg-16 and the up-to-date version of 
vgg-19. 
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Type 

1 im. time (ms) 

100 im. time (ms) 

CPU fully-connected layer 

16.1 

97.2 

CPU TT-layer 

1.2 

94.7 

GPU fully-connected layer 

2.7 

33 

GPU TT-layer 

1.9 

12.9 


Table 3: Inference time for a 25088 x 4096 fully-connected layer and its cotTesponding TT-layer 
with all the TT-ranks equal 4. The memory usage for feeding forward one image is 392MB for the 
fully-connected layer and 0.766MB for the TT-layer. 

In Table |2] we observe that the TT-layer in the best case manages to reduce the number of the 
parameters in the matrix W of the largest fully-connected layer by a factor of 194 622 (from 25088 x 
4096 parameters to 528) while increasing the top 5 etTor from 11.2 to 11.5. The compression 
factor of the whole network remains at the level of 3.9 because the TT-layer stops being the storage 
bottleneck. By compressing the largest of the remaining layers the compression factor goes up 
to 7.4. The baseline method when providing similar compression rates significantly increases the 
error. 

For comparison, consider the results of ll^ obtained for the compression of the fully-connected lay¬ 
ers of the Krizhevsky-type network |[T^ with the Fastfood method. The model achieves compression 
factors of 2-3 without decreasing the network error. 

6.4 Implementation details 

In all experiments we use our MATLAB extensiorQof the MatConvNet frameworkQ ll24l] . For the 
operations related to the TT-format we use the TT-Toolbo?(3 implemented in MATLAB as well. The 
experiments were performed on a computer with a quad-core Intel Core 15-4460 CPU, 16 GB RAM 
and a single NVidia Geforce GTX 980 GPU. We report the running times and the memory usage at 
the forward pass of the TT-layer and the baseline fully-connected layer in Table |3] 

We train all the networks with stochastic gradient descent with momentum (coefficient 0.9). We 
initialize all the parameters of the TT- and fully-connected layers with a Gaussian noise and put 
L2-regularization (weight 0.0005) on them. 

7 Discussion and future work 

Recent studies indicate high redundancy in the current neural network parametrization. To exploit 
this redundancy we propose to use the TT-decomposition framework on the weight matrix of a 
fully-connected layer and to use the cores of the decomposition as the parameters of the layer. This 
allows us to train the fully-connected layers compressed by up to 200 000 x compared with the 
explicit parametrization without significant error increase. Our experiments show that it is possible 
to capture complex dependencies within the data by using much more compact representations. On 
the other hand it becomes possible to use much wider layers than was available before and the 
preliminary experiments on the CIFAR-10 dataset show that wide and shallow TensorNets achieve 
promising results (setting new state-of-the-art for non-convolutional neural networks). 

Another appealing property of the TT-layer is faster inference time (compared with the correspond¬ 
ing fully-connected layer). All in all a wide and shallow TensorNet can become a time and memory 
efficient model to use in real time applications and on mobile devices. 

The main limiting factor for an M x N fully-connected layer size is its parameters number MN. 
The limiting factor for anM xN TT-layer is the maximal linear size max{M, N}. Asa future work 
we plan to consider the inputs and outputs of layers in the TT-format thus completely eliminating 
the dependency on M and N and allowing billions of hidden units in a TT-layer. 
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’https://github.com/Bihaqo/TensorNet 
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